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Genomic duplication-divergence events, which are the primary 
source of new protein functions, occur stochastically at a wide 
range of genomic scales, from single gene to whole genome du- 
plications. Clearly, this fundamental evolutionary process must 
have largely conditioned the emerging structure of protein- 
protein interaction (PPI) networks, that control many cellu- 
lar activities. We propose and asymptotically solve a general 
duplication-divergence model of PPI network evolution based 
on the statistical selection of duplication-derived interactions. 
We also introduce a conservation index, that formally defines 
the statistical evolutionary conservation of PPI networks. Dis- 
tinct conditions on microscopic parameters are then shown to 
control global conservation and topology of emerging PPI net- 
works. In particular, conserved, non-dense networks, which are 
the only ones of potential biological relevance, are also shown 
to be necessary scale-free. 

Introduction 

The primary source of new protein functions is generally 
considered to originate from duplication of existing genes fol- 
lowed by functional divergence of their duplicate copies [1,2]. 
In fact, duplication-divergence events occur at a wide range 
of genomic scales, from many independent duplications of 
individual genes to rare but evolutionary dramatic duplica- 
tions of entire genomes. For instance, there were between 
2 and 4 consecutive whole genome duplications in all ma- 
jor eukaryote kingdoms in the last 500MY, about 15% of 
life history (see refs in [3]). Extrapolating these "recent" 
records, one roughly expects a few tens consecutive whole 
genome duplications (or equivalent "doubling events" ) since 
the origin of life [3] . 

Clearly, this succession of whole genome duplications, 
together with the accumulation of individual gene duplica- 
tions, must have greatly contributed to shape the global 
structure of large biological networks, such as protein- 
protein interaction (PPI) networks, that control cellular ac- 
tivities. 

Ispolatov et al. [4], recently proposed an interesting 
local duplication-divergence model of PPI network evolu- 
tion based on natural selection at the level of individ- 
ual duplication-derived interactions and ii) a, time-linear in- 
crease in genome and PPI network sizes. Yet, we expect that 
independent local duplications and, a forceriori, partial or 



whole genome duplications all lead to exponential evolution- 
ary dynamics of PPI networks (as typically assumed at the 
scale of entire ecosystems). In the long time limit, expo- 
nential dynamics should outweigh all time-linear processes 
that have been assumed in PPI network evolution models 
so far [4-11]. 

This paper proposes and asymptotically solves a general 
duplication-divergence model based on prevailing exponen- 
tial dynamic^ of PPI network evolution under local, partial 
or global genome duplications. Our aim, here, is to establish 
a theoretical baseline from which other evolutionary pro- 
cesses beyond strict duplication-divergence events, such as 
shuffling of protein domains [3] or horizontal gene transfers, 
can then be considered. 



Results 

General Duplication-Divergence Model 

The general duplication-divergence (GDD) model is de- 
signed to capture large scale properties of PPI networks 
arising from statistical selection at the level of duplication- 
derived interactions, which we see as a spontaneous "back- 
ground" dynamics for PPI network evolution. 

At each time step, a fraction q of extant genes is dupli- 
cated, followed by functional divergence between duplicates. 
Fig. 1. In the following, we first solve the GDD model as- 
suming that q is constant over evolutionary time scales. We 
then study more realistic scenarios combining, for instance, 
rare whole genome duplications (g = 1) with more frequent 
local duplications of individual genes {q <^ 1), and including 
also stochastic fluctuations in all microscopic parameters of 
the GDD model (see Fig. 1 and below). 

Natural selection is modeled statistically {i.e., regar- 
less of specific evolutionary advantages) at the level of 
duplication-derived interactions. We assume that ancient 
and recent duplication-derived interactions are stochasti- 
cally conserved after each duplication with distinct prob- 
abilities 7ij's, depending only on the recently duplicated or 
non-duplicated state of each protein partners, as well as 
on the asymmetric divergence between gene duplicates [3], 
see Fig. 1 caption ('s' for "singular", non-duplicated genes 
and 'o'/^n' for "old" /"new" asymmetrically divergent dupli- 
cates). Hence, the GDD model depends on 1-1-6 parameters. 
I.e., q plus 6 7's (7^3, 7so 'Jsn, loo, 'Jon and 7„„). This pa- 



^ Results from the time-linear duplication-divergence model [4] are recovered as a special limit, sec Supporting Information. 
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Figure 1: General Duplication-Divergence Model for protein-protein interaction network evolution. Successive dupli- 
cations of a fraction q of genes are followed by an asymmetric divergence of gene duplicates {e.g. 2 vs 2'): "New" duplicates (n) are left 
essentially free to accumulate neutral mutations with the likely outcome to become nonfunctional and eventually deleted unless some 
"new", duplication- derived interactions are selected; "Old" duplicates (o), on the other hand, are more constrained to conserve "old" 
interactions already present before duplication. Links on the locally (q <S 1), partially (q < 1) or fully (q = 1) duplicated network are 
then preserved stochastically with different probabilities ■yij (0 < "fij < 1, i,j = s,o,n) reflecting the recent history of each interacting 
partners, that are either "singular", non-duplicated genes (s) or recently duplicated genes undergoing asymmetric divergence (o/n). 
The GDD model has 6 7ij parameters, which reduce to 3 for local {q ^ 1) and whole genome (g = 1) duplications (sec text). 



rameter space greatly simplifies, however, for two limit evo- 
lutionary scenarios of great biological importance: i) local 
duplications (g <C 1), controlled by 7^3, and 7s„, and 
a) whole genome duplications {q — 1), controlled by 700, 
7o„ and 7 

We study the GDD evolutionary dynamics of PPI net- 
works in terms of ensemble averages (Q'"') defined as the 
mean value of a feature Q over all realizations of the evolu- 
tionary dynamics after n successive duplications. This does 
not imply, of course, that all network realizations "coex- 
ist" but only that a random selection of them are reason- 
ably well characterized by the theoretical ensemble average. 
While generally not the case for exponentially growing sys- 
tems, we can show, here, that ensemble averages over all 
evolutionary dynamics indeed reflect the properties of typi- 
I 



cal network realizations for biologically relevant regimes (see 
Statistical properties of GDD models in Supp. Information). 

In the following, we focus the discussion on the num- 
ber of proteins (or "nodes") A^^ of connectivity k in PPI 
networks, while postponing the analysis of GDD models for 
simple non-local motifs to the end of the paper and Support- 
ing Information. The total number of nodes in the network 
is noted N = X^j,>o Nk and the total number of interactions 
(or "links") L = X^j.>q kNk/2. The dynamics of the ensem- 
ble averages (A'^^"') after n duplications is analyzed using a 
generating function, 

F("'(x) = ^{iV("')a;\ (1) 

fe>0 



The evolutionary dynamics of F^^' {x) correponds to the following recurrence deduced from the microscopic definition 
of the GDD model (see Supporting Information), 

f("+^)(x) = (l-g)7^("'(yl,(a;)) -HgF("'(^o(a;)) -f gF("'(yl„(x)) (2) 



where we note for i = s, o, n, 

Ai{x) = {l — q){piisX + 5is) + q{jioX + Sio){-yinX + Sin) (3) 

with jij = "fji and 5ij — 1 — 'yij corresponding to deletion 
probabilities {i,j = s,o,n). The average growth/decrease 
rate of connectivity for each type of nodes corresponds 
to A^(l) {i.e., degree k^kVi on average for node i — s,o,n), 

P. = A'iil) = (l-g)7,. + g(7.o + Hn) (4) 

In the following, we assume To > Pn by definition of "old" 
and "new" duplicates due to asymmetric divergence. 

^Note, however, that pseudogenes may still have a critical role 
adjacent genes. This supports a view of PPI network evolution in 



Evolutionary growth and conservation of PPI network 

The total number of nodes generated by the GDD model, 
F'"'(1), growths exponentially with the number of partial 
duplications, ^("-'(1) = C-(l + g)", where C is the ini- 
tial number of nodes, as a constant fraction of nodes g 
is duplicated at each time step. Yet, some nodes become 
completely disconnected from the rest of the graph during 
divergence and rejoin the disconnected component of size 
-F''"'(0). From a biological point of view, these disconnected 
nodes represent genes that have presumably lost all biolog- 
ical functions and become pseudogenes before being simply 
eliminated from the genome. We neglect the possibility for 

in evolution by providing functional domains that can be fused to 
terms of protein domains instead of entire proteins. Yet, it can be 
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nonfunctional genes to reconvert to functional genes again 
after suitable mutations, and remove them at each round of 
partial duplicatioijf], focussing solely on the connected part 
of the graph. 

In particular, the link growth rate (1 — q)rs + qTo + q^n , 
obtained by taking the first derivative of ((21) at s = 1, con- 
trols whether the connected part of the graph is exponen- 
tially growing (> 1) or shrinking (< 1). 

Let us now introduce another rate of prime biological 
interest, AI — {1 — q)Ts + qFo- It is the average rate of con- 
nectivity increase for the most conserved duplicate lineage, 
which corresponds to a stochastic alternance between singu- 
lar ('s') and most conserved ('o') duplicate descents. Hence, 
M = {1 — q)Ts + qTo can be seen as a network conservation 
index, since individual proteins in the network all tend to be 
conserved if M > 1, while non-conserved PPI networks arise 
from continuous renewing of nodes and local topologies, if 
M = (1 - q)r, -f gr„ < 1 (and (1 - q)r, + qTo + qr„ > 1 to 
ensure a non- vanishing connected network). Clearly, non- 
vanishing and conserved graphs seem the only networks of 
potential biological interest (see Discussion). The resulting 
conditions on GDD model parameters are summarized in 
Fig. 2. In particular, H-g— (1 — g)^(l — 7^3) > 1, implying 
7ss > 1 — Q, in the local duplication limit, q <^ 1. 

Evolution of PPI network degree distribution 

In practice, we rescale the exponentially growing connected 
graph by introducing a normalized generating function for 
the average degree distribution, 

(n.) / \ M k (n) {^k ') /r-\ 

p'- >{x) = ^pI'x with p)^, ' = — ■, (5) 

fc>l ^ ' 
I 



While A'"' is not known a priori and should, in gen- 
eral, be determined self-consistently with p^"'\x) itself, it 
is directly related to the evolution of the mean degree 
fc'"' = X/fc>i '^Pfc"' obtained by taking the first derivative of 
at 2; = I, 

fc'""^" ^ (1 - g)r. + gr„ + qr„ 

Hence, although connected networks grow exponen- 
tially both in terms of number of links (link growth rate 
{l — q)Ts+qro-\-qI'n) and number of connected nodes (node 
growth rate A'"'), features normalized over these growing 
networks, such as node mean connectivity ([9} or distribu- 
tions of node degree (or simple non-local motifs, see below) 
exhibit richer evolutionary dynamics in the asymptotic limit 
n^oo, as we now discuss. 




Figure 2: Evolutionary growth and conservation of PPI 
networks. Phase diagram of GDD models for local (blue, 5 <C 1, 
7ss = 1), partial (black, q < 1) and whole genome (red, q = I) 
duplications, in the ^(1— (j)ra -|- gFo -|-(?r„, (1— q)r3 -t-gPo) plane. 

where {N^"^) = Efe>i(^fc"')' after removing (A^^"'). 

F("'(x) can be reconstructed from the shifted degree distri- 
bution, p("'(x) =p^"\x) - 1, as, 

F'^"\x) = {Ar("^)p("'(s) + C • (1 + q)", (6) 

which yields the following recurrence for p("'( x). 



(7) 



(8) 

I 

Asymptotic analysis of node degree distribution 

The node degree distribution can be shown (see Supp. In- 
formation) to converge towards a limit function p{x), with 
p{x) — p{x) — 1 solution of the functional eq.([7ll 

{l^q)p{As{x))+qp{Aoix))+qp{Anix)) 
p{x) = (10) 

where A = lim„^oo A'"' with both A < 1 + the maxi- 
mum node growth rate, and A < (1 — q)rs -f qVo -\- (?r„, the 
link growth rate, as the number of connected nodes can- 
not increase faster than the number of links. Asymptotic 
regimes with A = (1 — q)rs -\- qTo -\- qTn correspond to the 
same exponential growth of the network in terms of con- 
nected nodes and links, and will be referred to as linear 
regimes, hereafter, while A < (1 — q)Ts -\- qVo -\- qVn cor- 
responds to non-linear asymptotic regimes, which imply a 



, {l-q)p^">{As{x))+qp^"\Ao{x))+qp^"\A4x)) 

P = AW 

where A^"^ is the ratio between two consecutive graph sizes in terms of connected nodes, 

A^"' = -^^J^ = -(l-<?)p("'(A(0)) -gp("'(A„(0)) -qp("'(A.(0)) >0 



shown [3] that extensive shuffling of protein domains does not actually change the general scale-free structure of PPI networks. 
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diverging mean connectivity fc^ ' ^ oo in the asymptotic 
limit n ^ oo, Eq.([9ll. 

In order to determine A and p{x) self-consistently, we 
first express successive derivatives of p{x) at a: = 1 in terms 
of lower derivatives, using Ea. (|10|) . 



(1 - g)rj + grS + gr;; 



! = [fc/21 



(11) 

where Ok.i are positive functions of the 1+6 parameters. 
Inspection of this expression readily defines two classes of 
asymptotic regimes, regular and singular regimes, which 
can be further analyzed with the "characteristic function" 
h{a) = (1 — g)r5 + qF" + gF", as outlined below and in 
Fig. 3 (see Asymptotic methods in Supp. Information for 
proof details). 
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Figure 3: Asymptotic degree distribution for GDD mod- 
els. Asymptotic regimes are deduced from the convex character- 
istic function h{a) and its derivatives h'{0) and h'(l) (see text). 

Regular regimes, if M' = maxi(ri) < 1, for i = s,o,n. In 
this case, the only possible solution is A = /i(l) (i.e. linear 
regime). Hence, since M' < 1, h{l) > h{k) and successive 
derivatives 9jp(l) are thus finite and positive for all fc > 1. 
This corresponds to an exponential decrease of the node 
degree distribution for fc ^ 1, oc e^''''' with a power law 
prefactor. The limit average connectivity ((9)1 is finite in this 
case, k < oo. 

Singular regimes, if M' = maxi(Fi) > 1, for i = s,o,n. In 
this case, Eg. ljlll) suggests that there exists an integer r > 1 
for which the rth-derivative is negative, 9Jp(l) < 0, which 
is impossible by definition. This simply means that neither 
this derivative nor any higher ones exist (for k > r). We 
thus look for self-consistent solutions of the "characteristic 
equation" h{a) — A, (with r — 1 < a < r) corresponding to 
a singularity of p{x) at a; = 1 and a power law tail of pk, for 
fc> 1 [12], 



p{x) 



AJl 



+ ■ 



and Pk oc k 



(12) 



where the singular term (1— a;)° is replaced by (1— a;)"" \n{l—x) 
for a = r exactly. Several asymptotic behaviors are pre- 
dicted from the convex shape of h{oi) {d^h > 0), depending 
on the signs of its derivatives h'{0) and h'{l), Fig. 3 (inset). 



• If h'{0) < and h'{l) < 0. There exists an a* > 1 
so that h{a*) = h{l) and the condition A < h{l) 
implies a* > a > 1. The solution a — 1 requires 
fe'(l) = and should be rejected in this case. Hence, 
since k < oo for a > 1, we must have A — h{l) (lin- 
ear regime) and a scale-free limit degree distribution 
with a unique a = a* >1, pk x ~ for fc ^ 1. 

• If h'{0) < and h'{l) = 0. a = 1, A = h{l) and 
Pk cx k~^ for A: ^ 1 (A:'"' oo as n oo). 

• If h'{0) < and h'{l) > 0. The general condition 
A < min(/i(0), /i(l)) leads a priori to a whole range 
of possible a G]0, 1] corresponding to stationary scale- 
free degree distributions with diverging mean degrees 
A;'"' oo. Yet, numerical simulations suggest that 
there might still be a unique asymptotic node growth 
rate A regardless of initial conditions or evolution tra- 
jectories, although convergence is extremely slow (See 
Numerical simulations in Supp. Information). 

• If h'(0) > and h'{l) > 0. A = h{0) = 1 + g imply- 
ing that all duplicated nodes are selected in this case. 
No suitable a exist as the node degree distribution is 
exponentially shifted towards higher and higher con- 
nectivities. This is a dense, non-stationary regime 
with seemingly little relevance to biological networks. 

Finally, note that the characteristic equation A — h{a) can 
be recovered directly from the average change of connectiv- 
ity k — > kVi and the following continuous approximation 
(using iV'"'=Efe^l"^-/„^""^'^'^ and (AT*" V 



{jV("+^)) ^ /((l~ g)iV; 
(ATM) - 



,)dk 



= h{a) 



Local and Global Duplication limits and realistic hy- 
brid models 

We focus here on the biologically relevant cases of grow- 
ing, yet not asymptotically dense networks. Figs. 4A & B 
summarize the asymptotic evolutionary dynamics of the 
GDD model in two limit cases of great biological impor- 
tance: i) for local duplication-divergence events (g ^ 1 and 
^ss = 1, Fig. 4A) and ii) for whole genome duplication- 
divergence events (g — 1, Fig. 4B), see Supp. Information 
for details. 

The local duplication-divergence limit leads to scale- 
free limit degree distributions for both conserved and non- 
conserved networks, with power law exponents 1 < a+1 < 3 
if 7so = 1 (i.e. which ensures that all previous interactions 
are conserved in at least one copy after duplication). 

By contrast, the whole genome duplication-divergence 
limit leads to a wide range of asymptotic behaviors from 
non-conserved, exponential regimes to conserved, scale free 
regimes with arbitrary power law exponents. Conserved, 
non-dense networks require, however, an asymmetric diver- 
gence between old and new duplicates (700 7^ 7nn) [3] and 
lead to scale-free limit degree distributions with power law 
exponents 1 < a -1- 1 < 3 for maximum divergence asymme- 
try (700 = 1 and 7„„ = 0). 
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Figure 4: Asymptotic phase diagram of PPI networks under the GDD model. A. Local duplication-divergence limit (g <C 1 
and 7ss = !)■ B. Whole genome duplication-divergence limit (q = 1). Boxed figures are power law exponents of scale-free regimes. 



We now outline the predictions for a more realistic GDD 
model combining _R— 1 ^ 1 local duplications (q <^ 1) for 
each whole genome duplication (g = 1). This hybrid model 
of PPI network evolution amounts to a simple extension of 
the initial GDD model with fixed q (see Supp. Information). 

Network conservation is now controlled by the cummu- 
lated product of connectivity growth/decrease rates over one 
whole genome duplication and R— 1 local duplications, fol- 
lowing the most conserved, "old" duplicate lineage. 



tween duplicates is still required, in practice, to obtain (con- 
served) scale-free networks. In this case, the asymptotic ex- 
ponent of the hybrid model ah lies between those for purely 
local {a[) and purely global (og) duplications, that are solu- 
tion of h{ae,q) = Ae and h{ag, 1) — Ag, with typical scale- 
free exponents ai + 1, Qg + 1 and, hence, + 1 G [2, 3], for 
k < oo. Analysis of available PPI data is discussed in [3]. 

The previous analysis can be readily extended to any 
duplication-divergence hybrid models with arbitrary series 
of the 1+6 microscopic parameters including stochastic fluc- 



M 



(Voil)- [{l-q)rs{q) + qro(q)]" ^Y^'' (13) tuations {g*"' , 7';' }^ G [0, 1], for i, j = s, o, n. Network con- 
V ^ servation still corresnonds to the condition A-f > 1 . where 



where we note the explicit dependence of Fi in g (i = s, o, n): 
ri{q) = (1— g)7is+g(7io+7i»i). Hence, conserved [resp. non- 
conserved] networks correspond to M > 1 [resp. M < 1]. 

A similar cummulated product also controls the effec- 
tive node degree exponent a and node growth rate A which 
are self-consistent solutions of the characteristic equation. 



(^h{a,l)-[h{a,q)]''' 



l/R 



= A 



(14) 



where we note the explicit dependence of function h{-) for 
a and q: h{a,q) = {l-q)Tf{q) + qV^{q) + qr^{q) as before. 

Hence, the asymptotic degree distribution for the hybrid 
model is controlled by the parameter 



M'= (r„(l)-max(rf-^(g)) 



l/R 



(15) 



with M' > 1 [resp. M' < 1] for scale-free (or dense) [resp. 
exponential] limit degree distribution. In particular, assum- 
ing r,(g) > ro(g), we find M'^ = ro(l)rf-^(g) and thus, 

M'^= r„(i) . 7f.-^(i + li^^"""^ - ^)Y" 

~ ro(l)v^[/l(l,g)]«-i for 7,, = 1, Rq^ < 1 

The square root dependency in terms of cummulated growth 
rate by R—1 local duplications, [h{l, q)]^'^^ , implies that 
non-conserved, exponential regimes for whole genome du- 
plications (if ro(l) < 1) are not easily compensated by lo- 
cal duplications, suggesting that asymmetric divergence be- 



servation still corresponds to the condition A4 > I, where 
the network conservation index now reads. 



M = 



n 

n[< 



(n)Np(n) 



(n)p(n) 



l/R 



(16) 



while the nature of the asymptotic degree distribution is 
controlled by, 

M'= ('nmax(r("')V (17) 

with A/' < 1 corresponding to exponential networks and 
M' > 1 to scale-free (or dense) networks with an effective 
node degree exponent a and effective node growth rate A 
that are self-consistent solutions of the generalized charac- 
teristic equation, 

l/R 



h{a) 



a,g 



(n) 



(18) 



This leads to exactly the same discussion for singular regimes 
as with constant g and Pi (Fig. 3) due to the convexity of 
the generalized function h{a) {d^h(a) > 0, see Supp. Infor- 
mation for details and discussion on the R ^ 00 limit). 
In particular, since (l-g("')ri"'+g("'ri"' < max,(r^'"') 

for all g^"' and F^"' (i — s,o,n), we always have M < M' . 
Hence, the evolution of PPI networks under the most 
general duplication-divergence hybrid model implies that 
all conserved networks are necessary scale-free (or dense) 
(1 < M < M'), while all exponential networks are necessary 
non-conserved {M < M' < 1), see Discussion below. 



5 



Simple non-local PPI network properties 

The generating function approach introduced for node de- 
gree evolution N^"'' (Fig. 5A) can also be applied to simple 
motifs of PPI networks, whose evolutionary conservation is 
also controlled by the same general condition M > 1 (see 
Discussion) . 

We just outline, here, the approach for two simple motifs 
capturing the node degree correlations between 2 interacting 
partners, A'^^"' (Fig- 5B) and 3 interacting partners (trian- 

gles), (Fig. 5C). 






'^k '"k,! ' k,l,m 

Figure 5: Simple correlation motifs in PPI networks. 

The evolutionary dynamics of these correlation motifs 
can be described in terms of generating functions. 



k I 

y , 



(19) 



fe>0, i>0, m>0 

and rescaled generating functions, 

r(") 



t'-"''{x,y,z) = 



X y , 



^ 2(L(")) 

fc>0, i>0 



E 



fc>0, i>(), m>0 



\^k.l,ml k I n 

— ; — 7-^x y Z 

6(T(")) ^ 



(21) 



(22) 



where (i'"') = 7^'"'(l,l)/2 is the number of links and 
(T'"') = r("'(l, 1, l)/6, the number of triangles. 

Linear recurrence relations similar to ((2)1 and (O can 
be written down for the generating functions H^"\x,y), 
T^"\x,y,z), h^"\x,y) and t^"\x,y,z) (see Supp. Infor- 
mation). These relations capturing all correlations between 
2 or 3 directly interacting partners can also be used to de- 
duce simpler and more familiar network features such as the 
distributions of neighbour average connectivity g(k) [13,14] 
and clustering coefficient C(k) [15,16], defined as, 

^ E.>o(^+i)<^r-'M) ^ di-'h'^r\^)u=o ^ 1 



(23) 

with 4"' (a;) = dyh'^'^\x,y)\y^i, h'-"\x) = /i'"'(x,l) and. 



fc(fc-l)(ivi"') 

6{r'")) dt'd. 



Xjloo^O 



k{k-l){N(")) d^p(^)ix)U=o 
where 4"^(x) = t^"\x, 1, 1) and 6{r("') = 



(24) 



Discussion 

We showed that general duplication-divergence processes 
can lead, in principle, to a broad variety of local and global 
topologies for conserved and non-conserved PPI networks. 
These are generic properties of GDD models, which are 
largely insensitive to intrinsic fluctuations of any micro- 
scopic parameters. 

Non-conserved networks emerge when most nodes disap- 
pear exponentially fast, over evolutionary time scales, and 
with them all traces of network evolution. The network 
topology is not preserved, but instead continuously renewed 
from duplication of the (few) most connected nodes. 

By contrast, conserved networks arise if (and only if) ex- 
tant proteins statistically keep on increasing their connectiv- 
ity once they have emerged from a duplication-divergence 
event. This implies that most proteins and their interac- 
tion partners are conserved throughout the evolution pro- 
cess, thereby ensuring that local topologies of previous PPI 
networks remain typically embedded in subsequent PPI net- 
works. Clearly, conserved, non-dense networks are the sole 
networks of potential biological relevance arising through 
general duplication-divergence processes. Such PPI net- 
works are also shown to be necessary scale-free (that is, 
regardless of other evolutionary advantages or selection 
drives than simple conservation of duplication-derived in- 
teractions) . 
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Supporting Information 



1 Proof of the evolutionary recurrence for the node degree generating function (Eq. 2) 

The generating function for node degrees A''^"' after n duplications is defined as, 

F<--\x) = Y.{Ni-^)x\ (25) 

fe>0 

where {•) corresponds to the ensemble average over all possible trajectories of the evolutionary dynamics. The x'' term of 
F'"' (x) "counts" the statistical number of nodes with exactly k links (one x per link) . 

At each time step n — » n + 1, each node can be either duplicated with probability q, giving rise to two node copies, or 
non-duplicated with probability 1 — q. Hence, in the general case with asymmetric divergence of duplicates (with a more 
conserved, "old" copy and a more divergent, "new" copy), there are 3 -F'"' (Ai(a;)^ contributions to the updated F^"'^^\x) 
coming from each node type, i = s,o,n, for singular nodes, old and new duplicates, 

F^"+^\x) = {l-q)F^"\A,ix))+qF^"\Aoix))+qF^"\A„{x)) (26) 

where the substitutions x Ai(x) in each F^"^ terms {i = s,o,n) should reflect the statistical fate of a particular link "x" 
between a node of type i and a neighbor node which is either singular (s) with probability 1 — g or duplicated (o/n) with 
probability q. In practice, the duplication of a fraction q of (neighbor) nodes first leads to the replacement x {l—q)x + qx'^ 
corresponding to the maximum preservation of links for both singular (a-) and duplicated o/n (a;^) neighbors, and then to 
the subtitution x —> yijx + Sij for each type of neighbor nodes j — s,o,n where yij is the probability to preserve a link 
"a;" (and Sij = 1— 7ij the probability to erase it). Hence, the complete substitution correponding to the GDD model reads 
X {l — q){'yisX + 3is) + qi-yioX + 5io){'yinX + Sin) = ^^(a::) for i = s, o,n, leading to fZ^ . 



2 Statistical properties of the model 

The approach we use to study the evolution of PPI networks under general duplication-divergence processes is based 
on ensemble averages over all evolutionary trajectories. We characterize, in particular, PPI network evolution in terms 
of average number of nodes and links and average degree distribution. Yet, in order for these average features to be 
representative of typical network dynamics, statistical fluctuations around the mean trajectory should not be too large. In 
practice, it means that the relative variance Xgi^) for feature Q'"' should not diverge in the limit n — > oo, 

2(n) f {Q')-{Qf Y"' ^ 

^« ^ (, {Qy ) 

and more generally the pth moment of Q'"^ should not diverge more rapidly than the pth power of the average. If it is not the 
case, successive moments exhibit a whole multifractal spectrum and ensemble averages do not represent typical realizations 
of the evolutionary dynamics. In order to check whether it is or not the case here for general duplication-divergence models, 
we proceed by analyzing the probability distributions for the number of links and nodes. 

The number of link L has a probability distribution P{L) whose generating function Vix) = X/l>o^(^)"^^ satisfies 

p'"+')(x) =p("'[a(a;)], (27) 

a{x) = (1 - qfi-yssX + Sss) + 2g(l - q)('ysoX + 5so){'ysnX + Ssn) + q'^i'yooX + Soo)i'ynnX + 5„„){'yonX + So„f . 

This relation can be justified in a way similar to that of the fundamental evolutionary recurrence above: each node of 
the initial graph will be either duplicated d with probability q or kept singular s with probability 1 — q, leading to three 
possible node combinations for each link: s — s link with probability (1 — q)^, s — d or d — s links with probability 
2q(l — q) and d ~ d link with probability q^ . Then each s — s link is either kept with 7^5 and erased with Sss leading to 
the substitution x — > jssX + Sss in the corresponding term; each s — d or d — s link can lead to two links between s and 
each o/n duplicate, i.e. x {'ysoX + 5ao)(cisnX + 5an), while each d — d link can lead up to 4 links after duplication, i.e. 
X — > {'yooX+5oo){'ynnX + 5nn){'yonX+5onf ■ Combluiug all these operations eventually yields equation (f27)) . 
Successive moments of this distribution are obtained taking successive derivatives of (|27|) . 



4")=9i-p(")(a;)|_^, (28) 

and lead to the following recurrence relations 
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where h{l) = a'(l) = (1 — q)Ts + qTo + qTn and C = a"{l) are constants depending on microscopic parameters. These 
relations can be solved to get the leading order behavior of successive moments 

= ^-[^(1)]"" (^1 + 0{[h{l)]-")^ , (29) 

where Ak are some functions of microscopic parameters. 

The latter relation implies that the fcth moment is equal (modulo some finite constant) to the fcth power of the first 
moment in the leading order when n — > oo. This suggests that in this limit the probability distribution should take a scaling 
form, 



This hypotesis can be verified directly from the explicit form of (|27|) fsee Appendix A for details). 

Although we are not able to determine the scaling function from previous considerations, we can derive some of its 
properties from the successive moments (|28p : in particular for n 3> 1 the link distribution and the function do not present 
a vanishing width around their mean value but instead a finite limit width corresponding to a finite relative variance, 

f {L^) - {LfY"^ 1 ( a"(l) 



This relation is found solving explicitly (|28|) for k = 1 and k — 2 given the initial number of links L^'^K Hence, although 
fluctuations in the number of links are important, they remain of the same order of magnitude as the mean value. This 
result is in fact rather surprising for a model which clearly exhibits a memory of its previous evolutionary states and might, 
in principle, develop diverging fluctuations in the asymptotic limit. 

Fluctuations for the total number of nodes, , and the number of nodes of degree fc > 1, nI"\ can also be evaluated 
using the previous result on link fluctuations and the double inequality < N < 2L, valid for any graph realization. 
Indeed, we obtain the following relations between the pth moments and the pth power of the corresponding first moments, 

< 2''{{L'')'-"'>) oc 2" {L'-''^y = (fc*"y{iV("))f, 

/■7:(") \ p 

using (L^")) = fc^"-'(Ar(")) and (iV^"') = p["^{iV("'), for aU n > 1 and k > I. Hence, we find that fluctuations for both 
A'" and Nk remain flnite in the asymptotic limit for linear asymptotic regimes corresponding to exponential or scale-free 
degree distributions with finite limit values for both mean degree, fc'"' — > A: < oo and degree distribution p^"' ^ > 0, 
for all fc > 1. This corresponds presumably to the most biologically relevant networks. On the other hand, for non-linear 
(scale-free or dense) asymptotic regimes previous arguments do not apply as fc*"^ oo (and p^"' —> for dense regime) 
when n —> oo. The numbers of nodes A'^^"^ and Nj."^ grow exponentially more slowly than the number of links L'"' in 
this case, and the growth process might develop, in principle, diverging fluctuations as compared to their averages, (A''"-') 
and (A^"''), respectively. Yet, numerical simultations (see section 8 below) tend to show that it is actually not the case, 
suggesting that the ensemble average approach we have used to study the GDD model is still valid for non-linear asymptotic 
regimes. 



3 Asymptotic methods 

In this section, we give more details about the asymptotic analysis of node degree distribution deflned by the recurrence 
relation on its normalized generating function p'"'(a;) (O. 

First of all, the series of p("-'(a;) can be shown to converge at each point at least for some initial conditions. Indeed, let 
us introduce a linear operator A4 defined on functions continous on [0, 1] and acting according to 0, i.e., p'""'"^' — TVfp'"'. 
For two non-negative functions f{x) and g{x) so that /(O) = 0, g{0) — 0, /(I) = 1 and g{l) — 1, we have, 

Vx G [0, 1] fix) < g{x) ^-ixe [0, 1] {Mf){x) < {Mg){x). (31) 

It can be verified that ifp^'^\x) = x (one simple link as initial condition), A4p^'^\x) < p^°\x) Va: G [0, 1] and by consequence, 
when applying A4" to this inequality, the following holds 

0<p'"+'>(x) <p'"'(x), VxG [0,1] 
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which means that at each point the series of p'"' (a::) is decreasing and converges to some non-negative value p{x). Futhermore, 
numerical simulations show that for an arbitrary initial condition, there exists an no > 1 sufBsiently large so that p^"\x) 
decreases for n > no. Hence, we can take the limit n ^ cxd on both sides of ^ to get the equation (|10p for the limit 
function p{x). 

We analyze the properties of this generating function p{x) for the limit degree distribution, using asymptotic methods. 
Indeed, we have no mean to solve analytically this functional equation to precisely obtain the corresponding limit degree 
distribution, but we have enough information to deduce its asymptotic behavior at large k, since it is directly related to 
the asymptotic properties of p{x) for x ^ 1. In the following, we note h{a) = (1 — q)T" + gF" + gF", following the same 
notation as in the main text. 

First, we consider the relation between successive derivatives of p{x) at a; = 1 deduced from (0 by taking the corre- 
sponding number of derivatives, ea. (|lip . 

fc 

= aip(i), (32) 

i = [fe/2] 

with some positif coefficients ak,i. The value of A in this relation is still unknown and should be determined self-consistently 
with p(x). Each of these derivatives can also be obtained as a limit of value i9*p(l) = lim„^oo 9^p'"'(l), with the following 
recurrence relation for d^p^"\l) = rn^j^^ 

-r^^ - + f Mfc - + . . . (33) 

directly derived from ((Tjl. Different regimes can be identified depending on the general convex shape of h{oi) {dah{a) > 0). 



h{k) 



Regular regimes - h{a) strictly decreasing for a > iff Ad' — maxi(Fi) < 1, for i = s,o,n. 

In this case, if we suppose that p'(l) is finite, all the derivatives of p{x) at a; = 1 are finite since A = h{l) and h{k) < h{l) 
for Vfc > 2. In fact, the alternative situation — oo and A < h{l) is not possible as it would imply that some first 

moments in (|33}, at least m["'^ and m'^\ would diverge exponentially as (/i(l)/A)". However, since h{k) < h{l) for k > 2, 
this would contradict the fact that the nth moment grows more rapidly than the nth power of the first one. Hence, we 
must have A = h{l) and the solution is not singular at a; = 1 but may have a singularity at some a::o > 1. 
Taking an anzats for the asymptotic expansion in the form 

p{x) = Ao - Ai{xo -x) + A2{xo - xf + Ac^ixo - x)" + O{{xo - a;)"+'). (34) 

and inserting it in pO|l we find that, in order to have the singularity at a; = a;o present on both sides of the equation, xo 
has to be chosen as the root closest to 1 in the following three equations, 

As{x) = X, Ao{x) = X, A„{x) = X, (35) 

where, Ai{x) = {l — q){'yisX + Sis) + q{"/ioX + Sio){'yinX + Sin) for i = s,o, n, or explicitly (since the second root is always 1) 

{I — q)Sss + qSsoSsn {I — q)Sso + qSooSon {1 — q)5sn + qSonSn 



Xo = mm 



q-ysofsn q'yoo'yon q'yo-a^nn 



Since h{a) is strictly decreasing when Fs < 1, Fo < 1 and F„ < 1, it is straightforward to prove that all three values are 
greater than one, and hence, a;o > 1 for regular regimes. 

The value of a is obtained from the same equation Q by comparing the coefficients in front of the singular terms when 
developping each term near x = xo 

In(e^A) 

"^R23ro' 

where i = s, o or n if a:o is the solution of Ai{x) = a;, = (1 — q)^^ , eo = e„ = g^^, and replacing also — » l/2ei or l/3ei 
if two or all three Fi's happen to be equal, respectively. 

We recall that for h{a) under consideration k = p'(l) is finite and A = h(l). Therefore, in this regime the asymptotic 
growth of the graph is exponential with respect to the number of links and the number of nodes with a common growth 
rate A = h{l). We call this asymptotic behavior ^'linear'^ because (L'"') and (A'^'"') are asymptotically proportional. 

The decrease of the limit degree distribution for fc ^ 1 is given by [12] 



PfeOcfc-"-'a:o"'(^l + C(^^jj, (37) 

and is thus exponential with a power law prefactor. When one of the F^'s tends to one, simultaneously a;o ^ 1 and a ^ oo 
and, as we will see below, we meet the singular scale-free regime for the limit mean degree distribution. 
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The emergence of an exponential tail for pk when k ^ 1 naturally comes from the fact that at each duplication step 
the probability for a node to duplicate one of its links (keeping both the original link and its copy), qjoo'yon for o nodes, 
q'Jso^sn for s nodes and qjon'ynn for n nodes, is smaller than the corresponding probabilities to delete the initial link, 
{l — q)5so + qSooSon, {i — q)Sss + qSsoSsn and {l — q)Ssn + qSonSnn (it is in fact equivalent to Xo > !)• For this reason at each 
duplication only few nodes are preserved and they keep only few of their links, the graph contains many small components 
and has no memory about previous states. In a different way, we can develop this argument in terms of a particular node 
degree evolution. When To < 1, Fs < 1 and r„ < 1, nodes o and s as well as their copies n loose links in proportion 
to their connectivities. It means that the number of nodes of a given connectivity is modified by a Poissonian prefactor, 
representing the overall tendency to follow an exponentially decreasing distribution for large number of duplications. 

Singular regimes - h{a) has a minimum on a > iff' M' — maxi(ri) > 1, for i = s,o, n. 

In this case, from p2p we can be sure to have a negative value for some derivative: since h{a) has a unique minimum, there 
exists an integer r > 1 so that h{r) < A < h{r + 1) implying that d!^'^^p{l) < which is impossible by construction. In 
fact, this indicates the presence of an irregular term in the development of p{x) in the vicinity of a; = 1, and for this reason 
the function itself is r times differentiable at this point while its (r + l)th and following derivatives do not exist. Hence, we 
take an anzats for p{x) in the neighborhood of a; = 1 using the following form 

p{x) = 1 - Ai{l -x)+ A2{1 - xf + A^(l - x)" + 0((1 - x)°'+^) (38) 

A priori, we do not know the exact value of A, and it is to be determined self-consistently with p{x). We then substitute 
PSI) into (|10|) to get a "characteristic" equation relating a and A, 

hia) = {l-q)r'^ +qr^ +qrZ=A. (39) 

If we find a nontrivial value of a* > that are solutions of this equation, it will give us an asymptotic expression for the 
coefficients of the generating function of the scale free form 



PkOck-"''^[ l + 0( ^ ) ), fc> 1. (40) 



Note that when the solution takes an integer value a* — r > 1 the form of the asymptotic expansion should differ from (|38|) 
because formally it is not longer singular in this case. In fact, in the anzats a logarithmic prefactor should be added in the 
singular term 

p{x) ^l-A^{l-x)+A2{l-xf + ... + Ar{l-xy + A,.{1 - xY ln(l -x) + 0((1 - xY+^) (41) 

In order for this asymptotic expansion to satisfy equation (|10|) . we should have h{r) = A, as before, as well as an additional 
condition for r — 1 namely h'{l) = 0. 

Note also, that the characteristic equation h{ce) — A can be recovered directly (although less rigorously) using the con- 
nectivity change k — + kFt on average for i-type of nodes (i = s, o, n) at each duplication and the following continuous 
approximation, A^^"' = Efe ^fe"' - J^Ni"Uu, 

^_ {N^^'^) J^{{l-q)Ni'^lr, + qNi'^lT, + qNipJ^)dk _ ((l-g)r^ +grg +gF^) /^(A^V^ _ 
where we assumed that (A^^"') ocfc~°~"'. 

Three cases should now been distinguished depending on the signs of h'{0) and h'{l) (see Fig. 3 in main text): 
1. h'{0) < and h'{l) < 0. 

Since h{a) >h{l) for a < 1, any solution of (|39|l has to be greater than one (as A < h{l)) which implies, by vertue of 
(|40[l . k<oo and consequently A = h{l) exactly (which is consistent with previous considerations). So, for the parameters 
satisfying h'{l) < the value of a we are looking for is the unique solution, a* > 1, of 

h{a*) = (l-g)rf +qrf +gFf = (l-g)F.+gro + qr„ = h{l) (42) 

The other solution a = 1 should be discarded here as it corresponds to a solution only if h'{l) = (see proof for the most 
general dupication-divergence hybrid models, below). 

Evidently, in this regime there exists an entier ko > 1 for which 

h{ko) < h{a) < h{ko + 1), 
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and so all the derivatives of are finite for k > ko while all following derivatives are infinite. Finally, when we fix 

Fi which are less than one and make other Fi ^ 1 the value of a* tends to infinity, the scale free regime (|40p meets the 
exponential one (|37|l . 

2. h'{0) < and h'{l) > 0. 

The condition A < h{l) implies that only solutions with < a < 1 are possible. Therefore, surely = oo in this case 
but there is no additional constraints a posteriori on A which might take, in principle, a whole range of possible values 
between mma{h{a)) and min(/i(0), h{l)). Yet, numerical simulations suggest that there might still be a unique asymptotic 
node growth rate A regardless of initial conditions or evolution trajectories, although convergence is extremely slow (See 
Numerical simulations below) . 

3. h'{0) > (we always have h'{l) > in this case). 

The minimum of h{a) is achieved for ao < in this case, and A < 1 + g < h{l). Yet because solutions of p9p cannot be 
negative by definition of p{x), the only possibility is A = 1 + g, implying that the graph grows at the maximum pace. From 
the point of view of the graph topology, it means that the mean degree distribution is not stationnary and for any fixed k 
the mean fraction of nodes with this connectivity k tends to zero when n ~> oo, the number of links grows too rapidly with 
respect to the number of nodes so that the graph gets more and more dense. For this reason, we refer at this regime as the 
dense one. 



4 Whole genome duplication-divergence model {q = 1) 

The case q = 1 describes the situation for which the entire genome is duplicated at each time step, corresponding to the 
evolution of PPI networks through whole genome duplications, as discussed in ref. [3]. All results obtained above in the 
general case remain valid although there are now no more "singular" genes (s) and thus no "fij's involving them. We just 
summarize these results here adopting the notations of ref. [3] for the only 3 relevant ytj's left; 70 = 700, 7n = 7nn and 
7 = 7o„, hence 

Fo = 7o + 7, F„ = 7„ + 7, (43) 
The model analysis then yields three difi'erent regimes (we do not consider the case Fo + F„ < 1 for which graphs vanish) 



1. Exponential regime Fo+F,i > 1, max(Fo,F„) < 1. The limit degree distribution is nontrivial and decreases like (|37|l 
with 

[ (3n<3/7'i7i Fo > F„ 

and 

ln(F<,+F„) 

a^—, ^ — -V, Fo/F„ (45) 

ln(2-max(F„,F„)) 

while 

«=^^i|^, forF„ = F„. (46) 
The rate of graph growth in number of nodes as well as in number of links is A = Fo+F„ 

2. Scale free regime (Fo > 1, F,j < 1) or (Fo < 1, F„ > 1). The limit degree distribution is surely nontrivial for 

/i'(l) = FolnFo + F„lnF„ < 0, (47) 

and described by an asymptotic formula (|40p with a* > 1 solution of 

F?+F^=Fo+F„, (48) 

In this case the ratio of two consecutive sizes is also A = Fo + F„. When 

/i'(l) =FolnFo + F„lnF„ >0, FoF„ < 1 (/i'(0) < 0), 

the mean degree distribution is still expected to converge to a nontrivial asymptotically scale-free distribution with 
< Q < 1. 

3. Dense regime FoF„ > 1 (i.e. h'{0) > 0). The mean degree distribution is not stationary: the growing graphs get 
more and more dense in the sense that the fraction of nodes with an arbitrary fixed connectivity tends to zero when n — > 00. 
Almost all new nodes are kept in the duplicated graph A = 1 + q. 
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Because all these regimes are defined in terms of two independent parameters (instead of three), the model phase 
diagram can be drawn in a plane (ro,rn), or equivalently in (ro+r„,ro) (See Fig. 4B). This last representation is adapted 
to show explicitly the domains of node conservation and graph growth, while the alternative choice (Fo + Fn, To — r„) used 
in ref. [3] is best suited to illustrate the asymmetric divergence requirement to obtained scale-free networks (see [3] for a 
detailed discussion). 



5 Local duplication-divergence limit {q 0) 

A diff'erent limit model is obtained for q going to zero when the mean size of the graph tends to infinity. In principle, the 
most general model of this kind is the one defined by a monotonous decreasing function q{{N)) with 

lim q{x) = 0. 

X — >oo 

For any function of this type, the graph growth rate in terms of links depends essentially on because 



(L(")) 



= (1 - q)rs + gFo + qr„ = 7^, + 2q{jso + 7sn - 7==) + C(9 j, 



and if jsa < 1 the ensemble average of graphs will never reach infinite size, it will have at most some finite dynamics. So, 
we will suppose that 7^5 = 1, to ensure cin infinite growth. \Ve remark clIso tlicit 'joo^ 'yon and 7„„ appear only in the term 
of order q^ in the last expression because two new nodes have to be kept in order to add any link of the type 00, on or nn. 

When q becomes small an approximate recursion relation for generating functions can be obtained by developping ^ 
(we set ^ss ~ 1) with 



p^"\A4x)) = p'-"\x) + q{{Sso + 7.ox)(5.„ + jsnx) - x)dxp^''\x) + 0{q^) 
p^-^\Ao{x))=p'-'^\5so+lsox) + 0{q) 
p'")(4„(s)) =p(")(5.„ +7.nx) +0(g) 

(49) 

gives in linear order of q 



+ 0{q^ 



,„ , , (1 - q)p^"\x) + q({Ssc + 'ysoX){Ss„ + Js,^x) - x)dxp'-"'''{x) + qp'^"^(Sso +JsoX) + qp''"'>{Ss„ +'ysnx) 
with 

^ (50) 



a'") = 1 ^ q(^SsoSsndxP^"\0) + P^"\5so) + P^"\5su) - 1^, 



an expression which does only depend on 3 of the 6 general 7ij's: 7^3, Jbo and 7sn. By neglecting terms in q^ we obtain 
a model for which duplicated nodes are completely decorrelated in the sense that the probability for an o or s node to 
have two new neighbours is zero, and consequently any two new nodes do not have common neighbors. This model can be 
regarded as a generalization of the local duplication model proposed in [4] for which only one node is duplicated per time 
step and without modification of the connectivities between any other existing nodes, i.e. 'jss ~ 1 and 7^0 = 1. Indeed, 
when taking for q a decreasing law 

on average A nodes per step are duplicated. By setting 730 = 1 in (|50|l we first get the following form for the recurrence 
relation, 

~{n+l). ^ p'-"\x)+q^snX{x-l)d^p^"'>{x)+qp'-"\5sn+'ysnX) („) 

P' '{x) = , A'^ ' =l~qp' '(dsn), (51) 

and then using the definitions of A'"' and p^"\x) © to reexpress it as, 

^(„+i) ^ ^(n) ^ ^^^^ _ _ A7,„fcp("' +aY^ C7,Si5:-^pi") , (52) 

s>k 

This expression is identical to the basic recurrence relation in the model of ref. [4] for A = 1. For an arbitrary A the 
asymptotic properties of the growing graph are essentially the same as in ref. [4], with only the growth rate modified by a 
factor proportional to A. 
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In the more general cases for which both 7s„ and 7so may vary (with 7^3 = 1 remaining fix to ensure a non-vanishing 
graph), an asymptotic analysis can be carried out for the limit degree distribution with an asymptotic solution of the form 

Pix) = ~Ai{l ~x)+ A2{1 - xf + A^{1 - x)"- + 0{{1 - x)"+') 
satisfying (|50p with q oc A/{N^"'''). The characteristic equation thus becomes, 

= ■yfo + ifn + ailso + 7sn - 1) - 1 = V', 

where ip is defined as 

f A^") - 1) 

lim A — L ^ A^"' ~ 1 + g'"V, ^ ^ oo- 

while the graph growth rate in terms of number of links is given by, 

(1 - q)Ts + qVo + gr„ = 1 + q{2-iso + i-ysn - 2) + 0{q^), 

at first order in q. Since the number of nodes can not grow more rapidly than the number of links, we can conclude that 
(fi < 2"/so + 'i.'yan — 2, in addition to, (p < 1, correponding to the maximum growth rate. Focussing the analysis on the case 
7so + 7sn > 1 for which the graph does not vanish, one finds that the "characteristic" function hi (a) is always convex, and 
the following results are obtained as in the asymptotic analysis of Sec. 3 in Supp. Information: 

• When /i;(0) < and < the characteristic equation has a solution, a* > 1, and the limit degree distribution is 
asymptotically scale-free pk oc fe"" ~^ with a* varying on the interval [1, 00) (depending on parameters 7so and 7sn) 
while ip = /i;(l). 

• For h'lil) — precisely, the singular term of the asymptotic solution becomes (1 — x) ln(l — x) and the limit degree 
distribution decreases as ps, oc fc~^, for A; ^ 1. 

• When h'i(0) < and /ij(l) > 0, scale-free regimes with slowly decreasing degree distributions are expected in general 
with ip < min(27i,o + 7s„ — 2, 1) and the corresponding < a < 1. 

• For h'[{0) > the mean degree distribution is not stationary, ip = 1. 

Fig. 4 summarizes these results for the limit degree distribution. More generally for 

q{{N)) = j^, A>0, /3>0, 

when /3 > 1, nodes a rarely duplicated so that the interval between two succesfuU duplications in number of steps is 
approximately 

noc {iV'"y"\ 

Therefore /3 > 1 gives a model equivalent to /3 = 1 with a change of time scale. On the other hand, for < /3 < 1 a set of 
nontrivial models is obtained. 



6 General duplication-divergence hybrid models 

We start the analysis of GDD hybrid models with the case of two duplication-divergence steps involving some fractions 
qi and q2 of duplicated genes, introducing explicit dependencies in q and x for Ai{q,x) and Ti{q) = dxAi(q, 1) functions 
{i = s,o, n), At{q,x) = {l-q){'ytsX + Sis) + q{fioX + Sio)['yinX + 5i„) and F^g) = (l-g)7is -I- q{-fio + 7™). 

An evolutionary recurrence for the hybrid generating function can be found by introducing the intermediate step 
explicitly, ^ f ^ where, 

{l-q^)p^-\As{qi,x))+qip<-"\A,{q^,x))+q,p^-\Ar.iq^,x)) 
' (x) — ^ ^- ^ 

At") = -{l-q,)p<--\As{qi,0)) - qi p("'(A„(gi , 0)) - gi p("'(A„(gi, 0)) > 0, 
and then f '"^ ^ p^"'+^'> with, 

il-q2)r("\Asiq2,x))+q2 r("\Ao{q2, x)) + q2 r^^\A^iq2 , x)) 

P = 7m 

^2 

A^"^ = -(l-g2)f("\4.(g2,0)) -ga f ("'(^„(g2, 0)) - g2 f("'(A„(g2, 0)) > 0, 
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• p("+i) step, 



which finally yields for the efi'ective p'"' 



~(n + l) / \ 



^(n)^{n) 



?2 



+ - 



(n) A (n) 



92 



+- 



(n) • (n) 



Expressing successive derivatives at 2: = 1, d^p{l), for fc > 2 in the asymptotic limit p^"\x) p{x) and A'" A 
for n^<x, yields, d^p{l) = (l-g2)(l-gi)r,*(?2)r^(gi)9^p(l) + (l-g2)gir^(g2)rS(gi)a*p(l) + • • • and hence, 

- ((1 - 9i)rS(gi) + girS(gi) + gir*(gi)) ((1 - gajr.^fe) + g2rS(g2) + g2r,';(g2) 



A2 



i = [fc/2] 



(53) 



In fact, this simple duplication-divergence combination can be generalized to any duplication-divergence hybrid models 
with arbitrary series of the 1-1-6 microscopic parameters 7'"'}^ G [Oilli for i,j = s,o,n and 1 < n < 7?. Each 

duplication-divergence step then corresponds to a different linear operator A^^"' defined by g^"' and the functional arguments 
A^\q^-\x) = (l-g("))(^W:r + 4")) -f g(")(7L"):c + 4">)(7<:':r + 5(:>) and r*"' = d^Al"\q^-\l) for i = s,o,n (with 
j4^"^(g''"', 1) — 1). Hence, applying the same reasoning as in Asymptotic methods to the series of linear operators {Al*-"'}^ 
implies that any duplication-divergence hybrid model converges in the asymptotic limit (at least for simple initial conditions). 

In the following, we first assume that the evolutionary dynamics remains cyclic with a finite period R, before discussing 
at the end the R—>cx limit, which can ultimately include intrinsic stochastic fluctuations of the microscopic parameters. 

In the cyclic case with a finite period R, successive derivatives at a; = 1, d^p{l), can be expressed in the asymptotic 
limit, p^"'\x) — > p{x) as. 



9'p(l) 1 



(1 - g("))ri">* + g(")ri"'* + g(")r(r^'' 



A« 



= ^Qfe.i dip{l), 

I=[fe/21 



(54) 



Network conservation for such general duplication-divergence hybrid model corresponds to the condition M > 1, where 
the conservation index M now reads 



M = 



i-9^"')r<"' 4-g'"'ri"' 



1/R 



while the nature of the asymptotic degree distribution is controlled by 



1/R 



(55) 



(56) 



with M' < 1 corresponding to exponential networks and M' > 1 to scale-free (or dense) networks with an effective node 
degree exponent a and effective node growth rate A that are self-consistent solutions of the generalized characteristic 
equation, 

1/R 



= A 



(57) 



The resolution of this generalized characteristic equation is done following exactly the same discussion for singular regimes 
as with constant g and Vi (Fig. 3 and main text) due to the convexity of the generalized h{a) function, d^h{a) > 0. 
Indeed, the first two derivatives of h{a) yield (with implicit dependency in successive duplication-divergence steps, 



Fi = r'"' , etc, for i = s,o,n) 



dah(a,q) 



'rJhf y^""^ ^ (i-g)r?inr. + grginr„ + grginr„ 

|J_ft^a,gjl (1 - g)rf + gr? + gr- 
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dlh{a) = 



1 A (i-g)gr?rg(inr, -inTo) +(i-g)gr?n(inr, -inr„) +g^rgr°(inro-inr„ 

R ^ h^{a,q) 

1/R 



dah{a, q) 
R ^-^ h{a,q) 



-y 



Let us now show that the solution of the generahzed characteristic equation corresponding to a = 1 implies = 0, 

which is an essential condition to prove the existence of scale-free asymptotic regimes with a unique power law exponent, 
•pk oc k~" with a* > 1 (see main text). 

The generalized functional equation defining the limit degree distribution for a GDD hybrid model with an arbitrary 
sequence of duplications contains a sum over 3^ terms with R times nested functional arguments, 

J " » ■' 

R times 

with all possible ij — s,o,n for I < j < R, and a prefactor c/ for I = {ii, . . . equal to a product of (1 — q'"''') or 
q(j) corresponding to each occurence of As"*' • • •) or Ai{n{q^''\ ■ ■ •), respectively, within the nested functional argument. 
Inserting the expansion anzats for a = 1 near x = 1, 

p{x) — — ai(l — .t) — a (1 - x) ln(l - x) + 0((1 - xf ln(l - x)) 

in the general functional equation yields the following form for each of the 3^ terms p{Ai(x)) of the functional sum (where 
Ai{x) is the nested functional argument), 

p{Ai{x)) -ai{l-Ai{x))~a'(l~Ai{x))ln{l~Ai{x)) = 

-aiA'i{l){l -x)- a'A'iil) InA'iWil - x) - a'A'iil){l - x) ln(l - x) + 0{{1 - xfhi{l - x)). 

where, 

R. R 

X{l)=l[d^A'f:\q<-"\l)=l[r\:\ and ^ cM^l) = Wl)]^ 

n n I 

Hence, after collecting all 3^ terms together we get for the functional equation, 

p(^) = -x)^ Y. ^^-^^(1) - "1 (^) ""(1 - ^) - «' (^) (1 - ^) Ml - 

As the solution a = 1 implies A = h{l), the last two terms on the right side of the functional equation correspond exactly 
to the expansion anzats of p{x) near x = 1 for a = 1, implying that the first term must vanish (with a' 7^ 0). This imposes 
the supplementary condition, 

^cMj(l)lnyi^(l) = 
I 

which is in fact equivalent to h' (1) = 0. 

Finally, let us discuss the case of infinite, non-cyclic series of duplication-divergence events, which can include intrinsic 
stochastic fiuctuations of all microscopic parameters. Formally, analyzing non-cyclic, instead of cyclic, infinite duplication- 
divergence series implies to exchange the orders for taking the two limits p'"'(j;) p{x) and R 00 (with 1 < n < R). 
Although this cannot be done directly with the present approach, either double limit order should be equivalent, when there 
is a unique asymptotic form independent from the initial conditions (and convergence path). We know from the previous 
analysis that it is indeed the case for the linear evolutionary regimes (with /i(l) = A) leading to exponential or scale-free 
asymptotic distributions (with a unique a* > 1). Hence, the main conclusions for biologically relevant regimes of the GDD 
model are insensitive to stochastic fluctuations of microscopic parameters. 

On the other hand, when the asymptotic limit is not unique, as might be the case for non-Unear evolutionary regimes, 
the order for taking the double limit p^"\x) — > p{x) and J? — + 00 (with 1 < n < R) might actually affect the asymptotic 
limit itself. Still, asymptotic convergence remains granted in both limit order cases (see above) and we do not expect that 
the general scale-free form of the asymptotic degree distribution radically changes. Moreover, numerical simulations seem 
in fact to indicate the existence of a unique limit form (at least in some non-Unear evolutionary regimes) but after extremely 
slow convergence, see Numerical simulations below. Yet, the unicity of the asymptotic form of the GDD model for general 
non-linear evolutionary regimes remains an open question. 
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7 Non-local properties of GDD Models 

The approach, based on generating functions we have developped to study the evolution of the mean degree distribution 
can also be applied to study the evolution of simple non-local motifs in the networks. Here, we consider two types of motifs: 
the two-node motif, A*'^"' (Fig- 5B), that contains information about the correlations of connectivities between nearest 

neighbors, and the three-node motif, T^"'^ (Fig- 5C), describing connectivity correlations within a triangular motif. Two 
generating functions can be defined for the average numbers of each one of these simple motifs, 

H^-\x,y) = (58) 

fe>0, i>0 

T^-\x,y,z) = J2 <^m!™)^V^'"- (59) 

fe>0, !>0, m>0 

By construction these functions are symmetric with respect to circular permutations of their arguments. 

By definition and symmetry properties of these generating functions, one obtains the mean number of links (L*-"^) or 
triangles (T'"'), by setting all arguments to one, 

//(")(:r = l,y = l) = 2{i(")), 

T<") {x = l,y = l,z = l) = 6{r'")), 

Hence, we can appropriately normalize these generating functions as, 



'^'"^(-'^) = E ^-V, (60) 

fe>0, I>0 



>{x,y,z) = 2^ gmn)N ^ (61) 

fe>0, i>0, m>0 

which yields two rescaled generating functions, varying from zero to one, for the two motif distributions. 

Linear recurrence relations can then be written for these generating functions H''"\ t'"', ft'"' and f'"', using the same 
approach as for the evolutionary recurrence (01 (see Appendix B for details). These relations which contain all information 
on 2- and 3-node motif correlations, can also be used to deduce simpler and more familiar quantities, such as the average 
connectivity of neighbors [13,14], g{k), and the clustering coefhcient [15,16], C{k). 

g{k) is defined on a particular network realization as, 

where d; denotes the connectivity of node i. This can be expressed in terms of the two-node motif of Fig. 5B and averaged 
over all trajectories of the stochastic network evolution after n duplications as, 

"'^{-^^' / = ^) • 

where the average of ratios can be replaced, in the asymptotic limit n oo, by the ratio of averages for linear growth 
regimes, for which fiuctuations of A^^"^ do not diverge (see section on Statistical properties of GDD models). Note, however, 
that this requires A'^^."' ^ 1 which excludes by definition the few most connected nodes (or "hubs", k > kh) for which 
(A^^"') < 1 (See section on Numerical simulations, below). 

With this asymptotic approximation (fc < kh), g''"\k) can then be expressed in terms of h''"\x,y) and its derivatives, 

,'")(fe)^^-'^f;(-)'-vi (63) 

Ox /i,) '(a;) 1^=0 

where h^"\x) = dyh'-"^ {x,y)\y=i, and h[^\x) — h'''^\x,y = 1). Hence, we can reduce the recurrence relation on h'^^^[x,y) 
to two recurrence relations on single variable functions ftj"'(a;) and li'^\x) with, 

h'-\x)^(k^"X'd^p^"\x) 



using the mean distribution function defined in ©. 



16 



By construction g{k) reflects correlations between connectivities of neighbor nodes and can actually be related to the 
conditional probability p{k'\k) to find a node of connectivity k' as a nearest neighbor of a node with connectivity k 



k' 

It is important to stress that g^"\k) defined in this way might be non-stationary even though a stationary degree 
distribution may exists. Indeed, by definition satisfies the following normalization condition, 

= ^ fc^pt"' = (64) 

k k 

which implies that g^^\k) should diverge whenever fc^^"' does so (and A;'"' < oo). This is in particular the case for 
actual PPI networks with scale- free degree distribution pj; oc with 2 < a -I- 1 < 3. When comparing actual PPI 

network data with GDD models (as discussed in ref. [3]), we have found that such divergence can be appropriately rescaled 

by the factor fc'^'/fc^* which yields quasi-stationary rescaled distributions fc''"'*;'-"' (fc)/^^' ' (see Numerical Simulations). 



The clustering coefficient, C{k), is traditionally defined as the ratio between the mean number of triangles passing by 
a node of connectivity k and k(k — l)/2, the maximum possible number of triangles around this node. When replacing the 
mean of ratios by the ratio of means in the asymptotic limit, as above, we can express C''"-'(fe) as, 



S;>0, m>o("^(fc-2,!,m)) 

fc(fc-l){iV("') 



Hence, this distribution is entirely determined by the following two generating functions ^'■"''(a;) and t'"^\x) — t^^\x, 1, 1) 

(k) = g^'4"'(^)U=o .g6^ 

fc(fc-l){iVW) dipi-){x)U=o ' ^ ' 

where 6{r'"') = t'l^\l). A self-consistent recurrence relation on tl^\x) can be deduce from the general recurrence relation 
on T'"' (x, J/, z). We postpone the detailed analysis of these quantities to futur publications. 



8 Numerical simulations 

We present in this section some numerical results which illustrate the main predicted regimes of the GDD model. The most 
direct way to study numerically PPI network evolution according to the GDD model is to simulate the local evolutionary 
rules on a graph defined, for example, as a collection of links. This kind of simulation gives access to all observables associated 
with the graph, while requiring a memory space and a number of operations per duplication step roughly proportional to 
the number of links. On the other hand, if we are interessed in node degree distribution only, a simpler and faster numerical 
approach can be used: instead of detailing the set of links explicitly, one can solely monitor the information concerning 
the collection of connectivities of the graph, ignoring correlations between connected nodes. At each duplication-divergence 
step, a fraction q of nodes from the current node degree distribution is duplicated and yields two duplicate copies ("old" 
and "new" ) while the complementary 1 — q fraction remains "singular" . Duplication-derived interactions are then deleted 
assuming a random distribution of old/new vs singular neighbor nodes with probability g vs 1 — g. The evolution of the 
connectivity distribution derived in this way corresponds exactly to the evolution of the average degree distribution; even 
though particular realizations are different, we obtain on average the correct mean degree distribution. This simulation only 
requires a memory space proportional to the maximum connectivity and a number of operation that is still proportional 
to the number of links. Since the number of links grows exponentially more rapidly than the maximum connectivity, this 
numerical approach provides an efficient alternative to perform large numbers of duplications as compared with direct 
simulations. The numerical results presented below are obtained using either approach and correspond only to a few 
parameter choices of the GDD model in the whole genome duplication-divergence limit (q = 1). These examples capture, 
however, the main features of every network evolution regimes. 



From scale-free to dense regimes 

We first present results for the most asymmetric whole genome duplication-divergence model [3] g = 1, 7oo = 1 and 7„„ — 
for four values of the only remaining variable parameter 7on — j = 0.1, 0.26, 0.5 and 0.7, Figs. SIA&B. As summarized on 
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the general phase diagram for q = 1, Fig. 4B, this model does not present any exponential regime, but a scale- free limit 
degree distribution pk~k~°' with a unique a* satisfying 

(1 + 7)"* +7"* = 1 + 27 

for 7 < 0.318, and a non- stationary dense regime for 7 > (a/S — l)/2 ~ 0.618, while the intermediate range 0.318 < 7 < 0.618 
corresponds to stationary scale-free degree distributions in the non-linear asymptotic regime (i.e. (l+7)"+7" = A < I-I-27) 
which we would like to investigate numerically in order to determine whether or not it corresponds to a unique pair (a. A), 
see discussion in Asymptotic methods. 

As can be seen in Fig. SIA, for 7 = 0.1 the degree distribution becomes almost stationary with the predicted power law 
exponent {a* -I- 1 ~ 2.75) for more than a decade in k and typical PPI network sizes (about 10'* nodes). Besides, this small 
value 7 = 0.1 appears to be within the most biologically relevant range of GDD parameters to fit the available PPI network 
data (including also indirect interactions within protein complexes), when protein domain shuffling events are taken into 
account, in addition to successive duplication-divergence processes, as discussed in ref. [3]. On the other hand, numerical 
node degree distributions are still quite far from convergence for 7 = 0.26 and even more so in the non-linear regime with 
7 — 0.5, even for very large PPI network sizes > 10^ connected nodes. 

Simulation results for the distributions of average connectivity of first neighbor proteins g{k) [13, 14] are also shown in 
Fig. SIA. g{k) is in fact normalized as g{k) ■ k/k'^ to rescale its main divergence [3]. A slow decrease of g{k) is followed 
by an abrupt fall at a threshold connectivity kh beyond which nodes (with k > kh) are rare and can be seen as "hubs" in 
individual graphs of size A'' {k^ corresponds to N-p^^ ~ 1). Degree distributions for large k > kh are governed by a "hub" 
statistics which is different, in general, from the predicted asymptotic statistics (although this is not so visible from the 
node degree distribution curves). 

Fig. SIB shows the evolution of the node degree distribution for the same most asymmetric whole genome duplication- 
divergence model with 7 = 0.7, corresponding to the predicted non-stationnary dense regime. As can be seen, the numerical 
curves obtained for different graph sizes are clearly non-stationary in the regions of small and large k, with local slopes 
varying considerably with the number of duplications (and mean size). This was obtained using the efficient numerical 
approach ignoring connectivity correlations (see above), which cannot, however, be used to study the average connectivity 
of first neighbor proteins g(k) (direct simulations can be performed though up to about TV = 10* nodes, as shown in 
Fig. SIB). 




Node degree (k) Node degree (k) 

Figure SI; Simulation results in the whole genome duplication-divergence limit with largest divergence asymmetry. 

A. Distribution pj, and g{k) obtained for 7 = 0.1 with n = 50 (magenta, N = 7 X IQ-^, L = 9 X 10'*) and n = 60 (red, N = i X 10**, 
L = 5.3 X 10*); for 7 = 0.26 with n = 25 (cyan, Af = 1.7 X 10*, L = 3.5 X 10*) and n = 30 (blue, Af = 1.3 x lO'' L = 2.9 X lO^^); for 
7 = 0.5 with n = 16 (light green, AT = 1.3 X 10*, L = 6.4 X 10*) and n = 18 (green, Af = 4.6 X 10*, L = 2.7 X 10^); average curves are 
obtained for 1000 iterations. B. Distribution pj. obtained for 7 = 0.7 with n = 12 (black, A^ = 4 X 10"^, L = 3.6 X 10*, g{k) is also 
shown in this case), n = 16 (red, AT = 6 X 10*, L = 1.2 X 10^) and n = 20 (green, Af = 9 X 10^, L = 3.9 X lO''). Distributions are 
averaged over 2000 iterations. 

Finally, we have studied numerically the convergence of the GDD model for these four parameter regimes, 7 = 0.1, 
0.26, 0.5 and 0.7. The results are presented in terms of A'"' (Fig. S2A) and its distribution (Fig. S2B) as well as through 



18 



the node variance Xiv'"' = {{N^"^^) - (iV'"') (Fig- S2C). Fig. S2A confirms that the convergence is essentially 

achieved for 7 = 0.1 while 7 = 0.26, 7 — 0.7 and above all 7 = 0.5 are much further away from their asymptotic limits. 
For instance, we have A^"' ~ 1.86 for 7 = 0.5 when {N^"'>} ~ lO'' nodes, while we know from the main asymptotic 
analysis detailed earlier that 1.9318 < A < 2 in the corresponding asymptotic limit. Yet, it is interesting to observed that 
these numerical simultations suggest that the asymptotic form for the non-linear regime 7 = 0.5 might still be unique, as 
convergence appears to be fairly insensitive to topological details of the initial graphs (Fig. S2A) and stochastic dispersions 
of the evolutionary trajectories: distributions of A'"' become even more narrow with successive duplications (Fig. S2B), 
while the dispersion in network size given by Xiv'"' is typically smaller for non-linear than linear regimes with a very slow 
increase for large network size {N^"^) > 10, 000 nodes (Fig. S2C). Still, a formal proof of such a unique asymptotic form (if 
correct) remains to be established, in general, for non-linear asymptotic regimes of the GDD model. 
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Figure S2: Asymptotic convergence for the whole genome duplication-divergence limit with largest divergence asym- 
metry. A. Asymptotic convergence of A^"' from a simple initial link (black), triangle (green) or 6-clique (red) for the GDD model 
with 9 = 1, 7oo = 1, 7nn = and four values of 7on = 7 = 0.1, 0.26, 0.5 and 0.7. The corresponding asymptotic limits, A = 1.2, 1.52, 
[1.9318;2] and 2, as well as the linear to non-linear regime threshold are shown on the right hand side of the plot. B. Distribution 
of A(") for successive duplications from different initial network topologies in the non-linear regime with 7 = 0.5. C. Node variance 

Xjv(") = ((Af(")^) - (Ar("))^)^''V(^'"') for the GDD model with q = l, ^00 = 1, 7nn = and four values of 7on = 7 = 0.1, 0.26, 0.5 
and 0.7 and starting from a simple link (2-clique). 



From exponential to dense regimes 

An example of GDD model exhibiting an exponential asymptotic degree distribution can be illustrated with a perfectly 
symmetric whole duplication-divergence model 5=1, 700 = 7on = 7nn = 7 < 0.5. The corresponding Fig. S3A shows 
a good agreement between theoretical prediction and the quasi exponential distribution obtained from simulations with 
7 = 0.4 < 0.5 (as 7 > 0.5 correspond to non-stationary dense regimes, see below). 

Finally, the same symmetric whole genome duplication-divergence model exhibits also a peculiar property due to the 
explicit form of its recurrence relation 

p("+^)(x) =p("'((7x- + 5)2) 

which happens to be precisely of the class of the link probability distribution Ea. (|67|l studied in Appendix A. Hence, in 
the limit of large n the corresponding degree distribution should have a scaling form as defined by Ea. (|68|) . Indeed, the 

simulation results depicted in Fig. S3B show that the scaling functions A:'"'j)^"'' = w(fc/fc "') plotted for different graph 
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sizes are perfectly close in the asymptotic limit, although the overall evolutionary dynamics is in the non-stationary dense 
regime, here, with 7 = 0.6 > 0.5 (i.e. k 
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Figure S3: Simulation results in the whole genome duplication-divergence limit with symmetric gene divergence. 

A. Distribution pj, obtained for 7 = 0.4 witli n = 15 (black, = 1.2 X 10^, L = 1.1 X 10^) and n = 20 (blue, = 1.2 X 10**, 
L = 1.2 X lO'l); B. Sealing function «i(fe/A;'"') (see text) obtained for 7 = 0.6 with n = 10 (black, Af = 1.2 X 10^, L = 6.3 X 10^), 

n = 12 (blue, AT = 4.7 X 10^, L = 3.7 X 10"*) and n = 13 (magenta, Af = 9.3 X 10^ , L = 8.8 X lO**) ; ui^fc/fc'"') is shown in both log-log 
and log-lin (inset) representations; average curves are obtained for 1000 iterations. 



Appendices 

A Scaling for Probability Distributions 

Let p^"' be a probability distribution whose generating function P'-"\x) = X/fePfe"^^*^ satisfies the following recurrence 
relation 

P'"+i)(a:) =p("'[a(a:)], (67) 

with a{x) a polynome with positive coefficients of degree m > 1 with a(l) = 1 and a'(l) > 1. This probability distribution 
can be shown to exhibit a scaling property 

p["^ ^[a{l)\-F{k/[a'{ir), n»l. (68) 

Indeed, we first remark that any polynome of this kind can be decomposed as 

mi 7712 

o-i^) = Y\.^^' ~^ '^'^'^ JI(%(^ + Cj)^ + bj), mi +m2 = m, 

where the first product collects the real roots of the polynome while the second product corresponds to all pairs of complex 
conjugate roots. Since all coefficients are positive, 7^, Si, Uj, bj and Cj are also positive. In addition, we can choose "/i+5i = 1 
and aj(l -I- Cj)^ + bj = 1 for all i and j. 

Then, the recurrence relation (|67p is equivalent to 

k 2hi 



k = ls/m] li=0 lmi=0 hl=Osi=0 hm2=0sm2=0 



hi J \ Si J V hm.. 
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where D„ = nmDo is the degree of P^"'>{x). In the following, we fix 71 2> 1 and suppose that the first moment is large 
A = [a'(l)]" S> 1, so that we can rescale all the variables as 

X = s/A, y = k/A, yi = k/A, wj = hj/A, Zj = Sj/A 

and finally replace the sums by integrals over rescaled variables. We choose also n to be sufficently large to have Dn/A ^ 1. 
We then apply Stirling formula to get a continuous approximation for binomial coefficients and use the expected scaling 
form of p^"' from (|68p . so that, when replacing sums by integrals in the continious approximation, we obtain, 



/■oo I'y py py p2wi py p2w^^ 

^ ^-n^mi/2+m2-i / dyF{y) / ••• / dyi . . . dymi / dwi / dzi . . . dWm^ / dZm2 

Jx/m Jo Jo Jo Jo Jo Jo 



with 



f{y, {Vi}, {wj}, {zj}) = ^iylny - (y ~ yi) \n{y - yt) - yt Inyi + yi In^i + (y - yi) In 5; j + 

i ^ ^ 

^ ^ I ?/ In J/ — (?/ — Wj) ln(y — Wj) — WiXwWj + Wi In aj + (y — Wj) In bj + 

i 

^2wi In 2wj — (2wj — Zj) ln[2wj — Zj) — Zj In Zj + {2wj — Zj) In Cj 

mi , ^ ^ 1/2 ma ^ X 1/2 

G{y,z,,...,z^) = ^ly,^—^—-^^ n((2,r)2,^.(j,-J,)(2™.-^.)j ' 

Since A is large, we can apply the Laplace method first to the mi + 2m2 internal integrals. We have to minimize / with 
respect to yi, Wj and zj given that '^^yt + zj = x. This can be performed by the Lagrange multiplier method by 
looking for the minimum of 

fiy, {yi}, {wj}, {zj}) - yi + ^ - ^) 



and 



and setting ^ . yi + Zj — x for the solution. 
In this way we obtain a unique minimum at 



,0 _ y „„o _ y_ ^0 ^ 2y 



yi ! ^j L ' 



with 

ffli = 1 + — e , gj = 1 + Cj-e , ftj = 1 + 



and A is determined implicitly as a function of x and y from the normalization condition 

y — + 2y — !— = x. 
Z-^ a, ^ hjgj 

After some algebra, we find that the values of / in the minimum is given by 

w{y, x) = f{y, {y^}, {w^}, {z°}) = (-(1 - «r') ln(l - a"^) - a'^ Ina"^ + a'^ In 7, + (1 - a"^) ln5, j + 

y ^ ( -(1 - ln(l - - hj^ InhJ^ + hj^ Inoj + (1 - hj^)lnbj + 

-2h-' [-{1 - g-') ln(l - g;') - In + (1 - gj') Inc,] ^ 
Therefore we write the leading contribution from the mi + 2m2 internal integrals in Ea. (|70p as. 
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with g{y,x) collecting all the contributions of the integrals, while the power of A can just be determined by the number of 
integrations left after integrating the delta function. 
The last integral to calculate in (|70|) is on y 

J x/m 

where we have collected all slow varying terms and constants in H(y,x). When applying the Laplace method we calculate 
the derivative of w{y,x) with respect to y that turns out to have a simple expression 

9,«;(y^:r) = ^ln(^)+^ln(^)=^ln(^,+7.e-^) + ^ln(a,(e-Vc,)^ + 6,)=0. 

i j i j 

The last condition is equivalent to ^^(iJi + 7ie~^) (cij (e^^ + Cj)^ + bj) = 1 which has a unique solution A = 0, and for 
the saddle point we get i/° = x/(^^ ■ji + aj{l + Cj)) — x/a {1). 

Now it is just a matter of tedious calculations to prove that the prefactor shrinks to l/a'(l) so that 

= [a'(l)] — ^F(^fc/[a'(l)]"+i^ , [a'(l)]"+i = A ■ a'{l), 

as anticipated from the scaling expression Ea. (|68|l . We were not able to determine the exact shape of the scaling function 
F which is strongly dependent on the initial probability distribution (an example is shown in Fig. S3B). 



B Recurrence relations on and T^") 

In order to relate _ff and we remark that by partial duplication process one motif (fc, I) of type Fig. 5B can 

generate up to three new motifs of this kind. If the middle link of this motif links two s nodes (probability (1 — q)^), the 
motif itself is kept with the probability 7^3 and its external connectivities are modified in the same way as the connectivities 
in the fundamental evolutionnary recurrence, i.e., 

xS' ^ \A,(x)f\A,[y)^ , 

so that the contribution of ss links to the 7/*""*"^' is given by 

(\ - qf-is.F^''\As{x),As{y)). 

If the middle link of the motif connects one s and one o nodes (proba q(l — g)), the link is presented with probability 
7so, and we have to substitute 

xS' ^ \A,(^x)f\A,{y)\' 

for external links plus one new link sn which gives the factor i&sn + '^snx). By itself this link can create a new motif whose 
consecutive substitution is 

xS' - \A,{xt\A^{y)t 
Therefore, the contribution of these two kinds of motifs is 

q(l-g)7,o(<5,„ +7,na:)iy("'(A(a:),Ao(y)) +g(l-g)7sn(<5so + 7=o2;)H^"^(A(x),A„(y)), 

and the contribution from motifs with the middle link os is just obtained through the permutation x^y 

g(l - g)7,o(<5sn + 7=n2/)-ff'"'(A<,(i), A,(y)) + g(l - g)7,„(5,o + 7,„y)//(")(yl„(a;), A,(y)). 

Finally, motifs with the middle 00 link can create 3 new motifs whose common contribution is obtained the same way as 
above 

g^7oo(<5on + lo-nX)(&o-n + 7onJ/) -H''"'' (^o (a;) , (?/) ) +g^7o»i(5oo + looX){&n-n + 7nnJ/)-H''"'' (^o (a:) , A„ (?/) ) + 

+g^7on(<5nn + 7nna;)(5oo + 7ooy) -ff'"^ (yl„ (a;) , Ao (j/) ) + g^7„„(<5o„ + ^o-nx){&o-n + 7ont/)//^"'(yl„(a;),yl„(t/)) 

By consequence, when collecting all this contributions we get a recurrence relation on the generating function _ff 

H^"^^\x,y) = (1 - g)S,,F(")(A,(:c), A,(y)) + (72) 
+g(l - g)7.o(5.„ + 7.„a:)i/'">(A,(a:), Ao{y)) + g(l - g)7.„i?'">(A(x), yl„(y)) + (x ^ y) + 

+ q^7oo(<5on + 7ona:)(5o„ + '^ony)Fl^'^\Ao{x),Ao{y)) + q^on{&oo + looX)(&n-a + 7nn 2/)-^ ( ( x) , A„ (y ) ) + 
+ g^7on((5nn + lnnX){&oo + ^ooy)Fl^'^\An(x) , Ao{y)) + q"irm(&oVL + "1onX)(&on + 7onJ/) -ff'"^ ( A„ (a;) , ^„ ( J/) ) 
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This relation preserves explicitly the symmetry with respect to a; <-> y. 

The recurrence relation on T^"^ is derived using the same arguments as above. We remark first that a triangle already 
presented in the graph can generate at most 7 new triangles, or more precisely no new triangle if it has 3s nodes, one new 
triangle if it has lo/2s nodes, up to 3 new triangles for 2o/ls nodes, and at most 7 new triangles when it consists of 3o 
nodes. As previously, for external links of the motif we just have to replace x, y or z hy the respective functions As, Ao or 
An- The contribution of 3s triangles is 

{l-qfllT^''\As{x),As{y),As{z)), 

the contribution of lo/2s triangles 

g(l - qf1L'yss{Ssn+■ysuy){Ssn+lsr^z)T'•'^\Ao(x),As(y),As{z)) + (73) 

+ q{l~qf'yLlss{5so+isoy){5so + -ysoz)T'^''\Ar,{x),As(y),As{z)) + [x->y^z) (74) 

where the last term stands for 4 terms obtained by circular permutations of 3 variables. The contribution of 2o/ls triangle 
will contain 4 terms plus 8 terms resulting from circular permutations of variables 

q^{l~ q)^soloo{Ssn + "/snxf{Son + ')ony){5on + 'yonZ)T''"'\As[x) , Ao{y) , Ao{z)) + 

g^(l - q)7so7on7sn(<5s7i + 'ys-nX){5so + 'ysoX){5oo + Too?/) ((5nn + 7„„ 2)T^"' ( As (x) , Ao (y) , A„ (z) ) + 
q^{l — g)7so7on7sn(<5sn + ')snX){5so + ')soX){5„„ + 7„„J/)(5oo + ')ooZ)T''"\As{x) , A„{y) , Ao{z)) + 
q^{l~ qhsnJnn{Sso + 'ysoxf{5on + 7ony)((5on + 7o,iZ)r'"' ( As (a:) , An{y) , An{z)) + 

+{x-^y^z). 

(75) 

The contribution of 3o triangles contains 8 terms 

'?^7oo(5on + lonxf{5on + lony)'^ {5on + 7on z) ^T^"' ( Ao (x) , Ao (y ) , Ao{z)) + 
'?^7nn('5on + lonX)^{5on + "yonyf {Son + 7on z) ^T'"'' ( A„ (x) , A„ (j/) , A„ (z) ) + 

<7^7oo7on(5nn + Inn.x)'^ {5oo + "/ooy)'^{Soo + ^ooZ)^T^"\A„{x) , Ao{y) , Ao{z)) + {x ~* y z) + 
q^lnnloniSoo + 7ooa;)^(5„n + 'Jnny)'^ {S„n + 'Jnn z)'^T^"\Ao{x) , An{y) , A„{z)) + {x ^ y ^ z) . 

When getting all these contributions together, the full recurrence relation on T^"^ is obtained. 

The mean number of triangles is evaluated from this relation by setting all variables to one, or directly when applying 



previous arguments to triangles irrespective of their external connectivities 

= [(l_g)3^3^+g(l_g)2^^^(^2^ + 3^2j+ (76) 

+ g^(l - g)(7oo7so + 37„„7s„ + 67so7sn7on) + (77) 

+ '7''(7L + 37007L + 37nn7on + 7L] (r*"'')- (78) 



It evidently presents an exponential growth, that is common for many extensive quantities related to the graph dynamics. 
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